Skip to content

Conversation

@ikawrakow
Copy link
Contributor

Instead of evaluating context + ending for the 4 endings, we evaluate context + ending for the 1st ending, and then just the ending using n_past = context length for the remaining 3. This gives a disappointing ~10% speedup for the hellaswag_val_full.txt data (see #2321).

Initially I tried 1st evaluating just the context, and then running the 4 endings with n_past = context length, but this was ~10% slower.

The efficiency gain is much more significant for a few-shot HellaSwag evaluation, where one adds to the context additional examples from a training dataset. For instance, with 3 additional examples, this PR runs nearly 2X faster compared to master (I could not go beyond 3 because I'm running into a ggml_new_object: not enough space in the context's memory pool error despite using -c 1024 and the context being just 647 tokens when it fails).

@ikawrakow ikawrakow requested a review from klosax August 20, 2023 07:39
@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

Tested with OpenLLaMA 3bv2-q8_0
The score seems to be much lower than master. For reference the HF final (10-shot) score is 71.6 on the F16 model.

hellaswag_chart

We might get a speedup and similar scores as master by using a whole task in one ctx window. And inserting BOS/EOS in between the full queries. Something like [BOS] context+ending0 [EOS] [BOS] context+ending1 ... I think one whole task should fit a ctx of 1024. If this works for one task, we could possibly include an additional task for even more speedup.

@ikawrakow
Copy link
Contributor Author

The score seems to be much lower than master. For reference the HF final (10-shot) score is 71.6 on the F16 model.

Hm, not sure. I tested with LLaMA-2 7B and the result matches 100% for the 800 tasks I checked. At 400 tasks this PR and master are both 0.25 higher compared to what is posted in the table in #2321. My hypothesis was that perhaps this is due to eps_rmse (I'm using 5e-6 instead of 1e-5 as it gives slightly lower perplexity).

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

It may be something with the openllama models. I will run a test using llama2-7b.

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Aug 20, 2023

Here is what I get with Master and this PR using OpenLLaMA-3B and fp16. You only see one curve because the result of both runs is identical.
hs_ol_3B

Here the first 100 tasks on Master:

ml-f16.bin -t 1 -ngl 100 --hellaswag-tasks 10042
main: build = 1007 (1f0bccb)
main: seed = 1692528110
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9
llama.cpp: loading model from ../models/ol_3B/ggml-f16.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 3200
llama_model_load_internal: n_mult = 216
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 26
llama_model_load_internal: n_rot = 100
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 8640
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 455.38 MB (+ 162.50 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 26 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 29/29 layers to GPU
llama_model_load_internal: total VRAM used: 6791 MB
llama_new_context_with_model: kv self size = 162.50 MB

system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
hellaswag_score : loaded 10042 tasks from prompt.
hellaswag_score : selecting 10042 randomized tasks.
hellaswag_score : calculating hellaswag score over selected tasks.

task acc_norm
1 100.00000000
2 50.00000000
3 66.66666667
4 75.00000000
5 80.00000000
6 83.33333333
7 71.42857143
8 62.50000000
9 55.55555556
10 50.00000000
11 45.45454545
12 41.66666667
13 38.46153846
14 35.71428571
15 40.00000000
16 43.75000000
17 41.17647059
18 44.44444444
19 47.36842105
20 50.00000000
21 52.38095238
22 50.00000000
23 52.17391304
24 50.00000000
25 52.00000000
26 53.84615385
27 51.85185185
28 50.00000000
29 51.72413793
30 53.33333333
31 54.83870968
32 56.25000000
33 54.54545455
34 52.94117647
35 54.28571429
36 55.55555556
37 54.05405405
38 55.26315789
39 56.41025641
40 55.00000000
41 56.09756098
42 57.14285714
43 58.13953488
44 59.09090909
45 57.77777778
46 56.52173913
47 57.44680851
48 58.33333333
49 59.18367347
50 60.00000000
51 60.78431373
52 61.53846154
53 60.37735849
54 61.11111111
55 60.00000000
56 58.92857143
57 59.64912281
58 60.34482759
59 61.01694915
60 60.00000000
61 60.65573770
62 61.29032258
63 61.90476190
64 60.93750000
65 60.00000000
66 60.60606061
67 61.19402985
68 60.29411765
69 60.86956522
70 61.42857143
71 61.97183099
72 62.50000000
73 61.64383562
74 60.81081081
75 60.00000000
76 60.52631579
77 59.74025974
78 58.97435897
79 59.49367089
80 60.00000000
81 59.25925926
82 59.75609756
83 59.03614458
84 59.52380952
85 58.82352941
86 59.30232558
87 59.77011494
88 60.22727273
89 60.67415730
90 61.11111111
91 61.53846154
92 60.86956522
93 61.29032258
94 61.70212766
95 62.10526316
96 62.50000000
97 61.85567010
98 61.22448980
99 61.61616162
100 61.00000000

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

Yes, my bad. I was comparing the PR against a run I did when the hellaswag was implemented. This means that some changes since then have lowered the scores that much!

@ikawrakow
Copy link
Contributor Author

And here is a quick comparison between 0-shot and 3-shot HellaSwag using OpenLLaMA-3B and fp16. 3-shot is 1.375 higher after 800 tasks (but the statistical uncertainty of ~0.9 of this estimate is still much too high to be able to tell for sure by how much 3-shot improves vs 0-shot).
hs_ol_3B_3shot

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

And here is a quick comparison between 0-shot and 3-shot

How does the 3shot work? By using EOS BOS in between the example tasks?

@ikawrakow
Copy link
Contributor Author

And here is a quick comparison between 0-shot and 3-shot

How does the 3shot work? By using EOS BOS in between the example tasks?

I tried a bunch of different stuff. The best approach so far is illustrated by the first task of the dataset:

Roof shingle removal: <item>A screen appears with a layout of a home and a lot of black, white and red words that say "tool box buzz news, reviews and information product reviews by industry expert todd fratzel". a picture of a tool appears on a red screen and to the right there are black words that say " ridgid model r040sca roofing cutter ".</item><item>This video allows viewers to hear the testimony of a customer who has used the pro roofing service. First the man tells how he heard about the service and how great it is. he also says that he chose pro roofing because the neighbors chose it and persuaded him by telling how great the service is.</item><item>A white screen appears with a black brand name that say's "the shingle hog" and it has a picture of a hog in the o of the word hog. a quick picture of a red and black tool is shown, then a man is talking and clips of people removing shingles play in between moments shown when the man talks.</item><item>A man is sitting on a roof. he
3
is using wrap to wrap a pair of skis.</item>
is ripping level tiles off.</item> 
is holding a rubik's cube.</item>
starts pulling up roofing on a roof.</item>

The first 3 query+response surrounded by <item>...</item> are taken from the HellaSwag training dataset, the last is the actual task query from the validation dataset. Not adding some kind of a separator between the sentences does not improve the score. This kind of makes sense as the sentences are not necessarily logically associated with each other, even if they come from the same activity group. I tried a few variations (e.g., <p>, <li>, etc., or making a json of the type "query":"response", or using [context]: [response]:, etc.), and the above version produces the highest score on 7B LLaMA-2 (0.95 higher than 0-shot after 10000 tasks).

@slaren
Copy link
Member

slaren commented Aug 20, 2023

I could not go beyond 3 because I'm running into a ggml_new_object: not enough space in the context's memory pool error despite using -c 1024 and the context being just 647 tokens when it fails

This could happen if you try to pass more tokens than n_batch to llama_eval. If you need to evaluate more than 512 tokens, you would have to split the input into multiple calls to llama_eval.

@ikawrakow
Copy link
Contributor Author

I could not go beyond 3 because I'm running into a ggml_new_object: not enough space in the context's memory pool error despite using -c 1024 and the context being just 647 tokens when it fails

This could happen if you try to pass more tokens than n_batch to llama_eval.

With other words, I need to split into batches if the number of tokens is greater than n_batch? Or can one now use n_batch > 512 without running into issue (as I did last time I tried that).

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

Set the n_batch = n_ctxif this is the problem.

@slaren
Copy link
Member

slaren commented Aug 20, 2023

I need to split into batches if the number of tokens is greater than n_batch

Yes, that's correct

Or can one now use n_batch > 512 without running into issue (as I did last time I tried that).

This looks like CUDA, which still uses the scratch buffers, which have a maximum batch size of 512. With the other backends, you should be able to use any batch size. However, keep in mind that the command line parsing at common.cpp caps the batch size to 512 regardless, so while it may be possible to initialize llama.cpp with a higher batch size in some backends, you need to specify it manually in llama_context_params.

@ikawrakow
Copy link
Contributor Author

@slaren Yes, I'm running on CUDA (doing HellaSwag on the CPU is a hopeless undertaking). But is it really the CUDA backend?. The error is raised in ggml_new_object(), which is called from ggml_new_tensor_2d() before things even went into the CUDA backend.

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

The first 3 query+response surrounded by ... are taken from the HellaSwag training dataset,

Are you saying you do the evaluation by first "steering" the model using the training dataset?

@ikawrakow
Copy link
Contributor Author

Are you saying you do the evaluation by first "steering" the model using the training dataset?

If this is how you want to see it, sure. The way I see it, this is basically my interpretation of what few-shot evaluation is.

@slaren
Copy link
Member

slaren commented Aug 20, 2023

The error is raised in ggml_new_object(), which is called from ggml_new_tensor_2d() before things even went into the CUDA backend.

So the problem is that due to some limitations, the CUDA backend still uses ggml scratch buffers for memory allocation, while the other backends have already moved to the graph allocator. The size of the scratch buffers was determined manually for a batch size of 512, while the graph allocator calculates the size of buffers dynamically and therefore can use any batch size. When using the graph allocator, the compute context is no_alloc, which means that tensor data is not allocated during calls such as ggml_new_tensor_2d(), but rather it is done later in a call to ggml_allocr_alloc_graph.

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

Are you saying you do the evaluation by first "steering" the model using the training dataset?

If this is how you want to see it, sure. The way I see it, this is basically my interpretation of what few-shot evaluation is.

Yes, but I do not think the training dataset should be used at all when testing. Have you tested using only data from the validation dataset?

@ikawrakow
Copy link
Contributor Author

So the problem is that due to some limitations, the CUDA backend still uses ggml scratch buffers for memory allocation

So, I guess the long-term solution would be to move the CUDA backend to the graph allocator. But what is a quick fix? Just play around with MEM_REQ_SCRATCH0 and MEM_REQ_SCRATCH1 until it runs? Or set n_batch and hope for the best? As far as I can tell increasing n_batch will not help as it runs out of space in the second scratch buffer, which does not depend on context or batch size.

@slaren
Copy link
Member

slaren commented Aug 20, 2023

But what is a quick fix? Just play around with MEM_REQ_SCRATCH0 and MEM_REQ_SCRATCH1 until it runs?

That should work. However, I think that the proper solution would be to respect n_batch in the perplexity tool, do the evaluation in multiple calls if needed.

@ikawrakow ikawrakow merged commit 5e9ff54 into master Aug 20, 2023
@ikawrakow ikawrakow deleted the ik/hellaswag branch August 20, 2023 13:44
@ikawrakow
Copy link
Contributor Author

I noticed that I had not downloaded OpenLLaMA-v2, so the above graph is for OpenLLaMA-v1. For the sake of completeness, here is a graph showing HellaSwag scores for OpenLLaMA-v2-3B, this time for all 10042 tasks. 3-shot ends with a score of 70.8, which is not too far from the 10-shot 71.6 result posted on HF.
hs_ol_3B_3shot1

@klosax
Copy link
Contributor

klosax commented Aug 20, 2023

Maybe test this PR in the GGUF branch. The perplexity is slightly lower with the changes there, so hellaswag scores could be higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants